Part I: Data Wrangling (Michigan Exit Poll Data)

We Need a Horse, Lasso
We Need a Horse, Lasso

Data Wrangling: The process of getting the data ready for analysis, including: accessing data, reformatting in appropriate ways (format, orientation), creating variables, recoding values, and selecting variables and/or observations of interest for analysis.

How you process and recode data is critical! Principles that should guide us include:

We want to do things that are understandable and sensible. Always imagine that you are handing the code off to someone else – can they understand what you are doing? Can they follow what you have done to the data? The goal in data wrangling is to provide a set of steps that will take you from the raw data to the data being analyzed in clear and understandable steps so that others can see exactly the choices that you have made (and make other choices if desired to see the robustness of your inferences).

Data wrangling in R gives us the ability to do this because we can document and explain the exact steps we are taking to organize the data. Using code to make changes means anyone and everyone can observe what we did.

Don’t Be This Guy
Don’t Be This Guy

In contrast, if you make changes to your data in a spreadsheet like Excel or Google Sheets or Numbers it can be impossible for someone else to figure out what you did. What variables did you rename, what transformations did you do, how did you recode values?

People using Excel for analysis have been identified as being (partially) responsible for what is widely believed to be a failed global economic policy.

Source of many, many errors: World Economy
Source of many, many errors: World Economy

And using Excel for data entry has resulted in well-documented errors in Medical Research due to “auto-correct” changing value names.

Source of many, many errors: Medical Research
Source of many, many errors: Medical Research

In fact, Dr. Jennifer Byrne - named a top 10 person in science in 2017 because of her data sleuthing in 2017 has developed an entire program to find and correct errors produced by data entry in Excel! Link here.

Dr. Byrne, Cancer Researcher, Data Detective
Dr. Byrne, Cancer Researcher, Data Detective

So the goal of this topic is to give you some basic tools to be able to manipulate data and get it ready for analysis. This includes inspecting your data and getting a sense of the type of data you have.

RECAP: Starting out

When you start out (in this class) it is always a good idea to have a clean – i.e., empty – workspace so that your analysis is based on the code that you are doing rather than some manipulations that were done previously. Do not be one of those students at the end of the class who has every dataset they worked with in class saved and loaded in their Environment.

If you are starting out with an empty workspace your R Studio should say the “Environment is empty” in the Environment tab. it is always a good idea to always start with an empty environment because the data manipulations we are doing will not be cumulative.

Starting Out
Starting Out

To that end, when you quit RStudio, DO NOT save workspace image. This will ensure that you are always starting with a clean and empty workspace which means that there is never a possibility of using “old” data in your analyses. If you are working on a larger project across several sessions it sometimes makes sense to save it, but for this class you should get in the practice of writing self-contained code that reads and wrangles the data you need to analyses every time rather than relying on earlier work.

DO NOT SAVE YOUR WORKSPACE WHEN YOU EXIT
DO NOT SAVE YOUR WORKSPACE WHEN YOU EXIT

When starting out it can be easy to forget that R needs to be told where to look to find things!

  1. For every assignment/lecture create a new folder on your computer (e.g., “Topic 3”).

  2. Download the code and data into this folder. (Or move them from the Download folder to that folder). Or create a new RMarkdown file and save the code to this folder.

  3. In the Console window of RStudio figure out where R thinks it is by looking at the title bar or using getwd (get working directory)!

getwd()
## [1] "/Users/rweldziu/Library/CloudStorage/Dropbox/Villanova/TEACHING/PSC7000 - Research Concepts and Approaches/PSC7000_F2024/Homeworks"

If this is different than where your code and data is there will be trouble!

Because R is object-oriented, we can actually save this as an object and use it!

mydir <- getwd()
  1. Manually set the working directory to be the directory where your code and data is located using setwd (set working directory) and the location we just saved to the object mydir!
setwd(mydir)

We can also use RStudio to accomplish the same task.

  1. Open your code in R-Studio.

  2. Using the GUI to Select Directory
    Using the GUI to Select Directory
  3. The GUI will automatically change the directory, but you should also copy and paste the resulting code snippet in Console to your code!

Warning!

What R “sees” when working in the Console window (or using an R Script) is different from what it “sees” using RMarkdown! You can Knit a RMarkdown document to conduct analysis without any object ever appearing in the Global Environment.

Quick Exercise Clear out the Global Environment and Knit this document. Confirm that the code will run and nothing will be created in the Global Environment.

While this makes sense once the code/project is completed, it can make coding more difficult because we often want to be able to manipulate the code and objects while working. To do so we are going to write and evaluate chunks one-at-a-time to ensure the objects are created in the Global Environment.

So getting started we always start our code by:

  1. Loading the tidyverse library and any other libraries we are going to use.
library(tidyverse)
  1. Checking to sure that R is looking in right spot to find your data and code. It typically makes sense to setwd to tell R where in your computer it should be looking for additional data (and code).
getwd()
setwd()
  1. Now get that data! One method is to use the Graphical User Interface (GUI) to select menus that will find and read in the data of interest. If you are getting data from some other format (e.g., data saved in a basic text format (comma or tab delimted) or in another statistical package (e.g., STATA or SAS)) we want to Import Dataset.

RStudio Starting Out If you are loading in data that is already formatted in R then you go to the RStudio menu: File > Open File.... (Note that this menu also has the File > Import Dataset option.)

Because we want to document what dataset we are using, for replicability we want to copy and paste the resulting code into our RMarkdown file so that next time we can run the code directly!

A second way to access data is to use some functions directly in R. This is what the GUI is actually doing behind the scenes. We are going to focus on three functions in this class: - load - read_rds - url

A tidyverse function we will sometimes use is read_rds (Read R Data Structure). This will load a single object into memory. Here we are reading in a subset of the 2020 Michigan Exit Poll and naming it the tibble mi_ep.

mi_ep <- read_rds("../data/MI2020_ExitPoll_small.rds")

The first part of the code is defining a new object called mi_ep (short for Michigan exit poll) and then assigning (<-) that object to be defined by the dataset being read in by the read_rds function.

To highlight the syntax, consider what happens when you run the following code?

read_rds("../data/MI2020_ExitPoll_small.rds")

If we read in data without assigning it to an object R will simply print the results to the console window. Nothing will be saved and we cannot use the data in future analyses! When reading in an R data object using read_rds we always need to define a new object associated with that file!

Before moving on, let’s clean up our Environment – always a good idea between lectures and assignments! – using either the broom icon in the Environment GUI to delete every object or the rm (remove) command to remove specific objects. Since we only have one object in our Environment so far we can use rm to remove it.

rm(mi_ep)

A nice thing about R is that it can have multiple objects in the Global Environment at the same time. This allows you to work with multiple datasets and objects. Instead of loading them all into memory one at a time using read_rds, we can also use the load function to load several R objects that have been saved into a single file (using the save function) all at once.

In the file named MI2020_ExitPoll.Rdata I have previously saved the entire 2020 Michigan Exit Poll (the object named MI_final) as well as a subset of the exit poll that focuses on variables we are going to analyze in class (named MI_final_small). To give you a chance to do your own analyses, I have saved both files together so you can access either one.

Note that when we are loading an .Rdata file we do not need to assign it to an object because the objects are already defined within the .Rdata object. You can see this when you use the objects function to print to the console window the objects that are loaded into memory (which will also match the Global Environment list when we run the code chunk-by-chunk):

load(file = "../data/MI2020_ExitPoll.Rdata")
objects()
## [1] "MI_final"       "MI_final_small" "mydir"

Relatedly, because we are loading several objects at once the name of the file will not be the same as the name of the objects being loaded.

NOTE: R will look for this file relative to where it is (as is given by getwd). Here I have created a data folder located within the current directory and telling R to look for my data in that folder!

We can also read in non-R files. For example, we can read in comma-delimited files (read.csv), tab-delimited files (read.delim), Excel-created files (read.xls, read.xlsx), and files from other statistical languages (e.g., read.dta, read.sav). Reading in non-R objects/data requires reading in the data and also assigning the data to a new R object.

Trouble with tibbles?

Tibble?
Tibble?
dim(MI_final_small)
## [1] 1231   14

In most of the work that we will do in DSCI 1000 rows typically refer to observations – e.g., survey respondents – and columns refer to characteristics of those observations (e.g., survey responses) which we call variables.

We can use matrix indexing to examine/extract specific rows and columns of tibbles/dataframes. To select the first three rows we would use:

MI_final_small[1:3,]
## # A tibble: 3 × 14
##     SEX AGE10 PRSMI20 PARTYID WEIGHT QRACEAI EDUC18  LGBT BRNAGAIN LATINOS
##   <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <dbl>    <dbl>   <dbl>
## 1     2     2       1       3  0.405       1      4    NA       NA       2
## 2     2    10       1       1  1.81        2      1     2        1       2
## 3     2     7       1       1  0.860       1      5     2        2       2
## # ℹ 4 more variables: RACISM20 <dbl>, QLT20 <fct>, preschoice <chr>,
## #   Quality <chr>

To select the first three columns…

MI_final_small[,1:3]
## # A tibble: 1,231 × 3
##      SEX AGE10 PRSMI20
##    <dbl> <dbl>   <dbl>
##  1     2     2       1
##  2     2    10       1
##  3     2     7       1
##  4     1     9       1
##  5     2     8       1
##  6     2     7       1
##  7     1     9       1
##  8     1     8       1
##  9     2     6       2
## 10     1     8       1
## # ℹ 1,221 more rows

Selecting the first two rows and first four columns

MI_final_small[1:2,1:4]
## # A tibble: 2 × 4
##     SEX AGE10 PRSMI20 PARTYID
##   <dbl> <dbl>   <dbl>   <dbl>
## 1     2     2       1       3
## 2     2    10       1       1

Selecting a collection/combination of rows and all columns we would use:

MI_final_small[c(1:2,45,100),]
## # A tibble: 4 × 14
##     SEX AGE10 PRSMI20 PARTYID WEIGHT QRACEAI EDUC18  LGBT BRNAGAIN LATINOS
##   <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <dbl>    <dbl>   <dbl>
## 1     2     2       1       3  0.405       1      4    NA       NA       2
## 2     2    10       1       1  1.81        2      1     2        1       2
## 3     2     7       1       3  0.488       1      3    NA       NA       2
## 4     2    10       2       4  0.299       1      2     2        1       2
## # ℹ 4 more variables: RACISM20 <dbl>, QLT20 <fct>, preschoice <chr>,
## #   Quality <chr>

Outline for the Next Few Lectures

2020 MI Exit Poll

To do our data wrangling we are going to wrangle the 2020 National Exit Poll from the National Election Pool in the state of Michigan.

Exit Polls in US Elections

Exit polls are polls that are done on (or immediately before) elections in the United States. They are used to help interpret the meaning of an election – i.e., why voters voted the way they did, whether some types of voters were more likely to support one candidate over the other, etc. – but intelligent prognosticators do not use them to actually project who is going to win. Put differently, they are used to help intepret an election, but they are not used to predict it (which is what voting data is used for).

Filling Out an Exit Poll
Filling Out an Exit Poll

In class we are goign to focus on an Exit Poll from the National Election Pool for the state of Michigan. Michigan is increasingly thought of as a ``swing state” that could be won by either a Democrat or a Republican in a presidential contest following is surprising support for President Trump in the 2016 presidential election. Michigan is also a rather diverse state in terms of its population and interests and some have worked to identify groups of like-minded voters within the state. (Note that we will also do this when we get to the “clustering” topic!)

Why Michigan?
Why Michigan?

The Exit Poll data we will analyze is based on the following questionaire that reports the precise questions that were asked, as well as the value labels associated with each response. So while the data that we will read in will have responses that are coded as a “1” or “3” or “4”, interpreting the meaning of those values requires comparing the values to the questions below.

2020 MI Exit Poll
2020 MI Exit Poll

Lots of interesting questions!

Predictive: Use the data to predict an outcome of interest.

Descriptive: Use the data to describe an event.

THINKING: Is any of it causal? Can we determine what causes a voter to support a candidate from this data?

Whenever we do anything using data we should first inspect our data to make sure it makes sense. The function glimpse gives us a quick summary by printing the first few observations of every variable to the screen. Note that the presentation of the data is flipped (technically transposed) so that the columns are presented as rows.

glimpse(MI_final_small)
## Rows: 1,231
## Columns: 14
## $ SEX        <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2,…
## $ AGE10      <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 6, 8, 9, 10, 1, 5, 9, 10, 8, 4, 1,…
## $ PRSMI20    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1,…
## $ PARTYID    <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1,…
## $ WEIGHT     <dbl> 0.4045421, 1.8052619, 0.8601966, 0.1991648, 0.1772090, 0.49…
## $ QRACEAI    <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 9, 1, 1, 1, 1, 1, 3, 1, 1, 1,…
## $ EDUC18     <dbl> 4, 1, 5, 4, 5, 3, 3, 3, 4, 4, 5, 5, 4, 1, 1, 1, 5, 2, 4, 2,…
## $ LGBT       <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, NA, NA, 2, NA, 2, 2,…
## $ BRNAGAIN   <dbl> NA, 1, 2, NA, NA, 2, 1, 2, NA, NA, NA, NA, NA, 2, NA, 2, 1,…
## $ LATINOS    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,…
## $ RACISM20   <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, NA, NA, 2, NA, 9, 4,…
## $ QLT20      <fct> Has good judgment, NA, NA, Has good judgment, Cares about p…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality    <chr> "Has good judgment", NA, NA, "Has good judgment", "Cares ab…

This is a useful representation as we can immediately see what variables we have as well as what some of the values are. Here, for example, we see that we have missing data – denoted as NA in R - in quite a few variables. We also see that some variables have numbers for values while others have strings of letters (e.g., QLT20, preschoice).

This variation highlights the fact that there are several types of data:

  1. <dbl> Double. “Numbers as a number.” Numbers stored to a high level of scientific precision. Mathematical operations are defined. (At least in theory!) e.g., SEX

  2. <int> Integer. “Numbers as a number.” Mathematical operations are defined. (At least in theory!) R treats <dbl> and <int> as largely interchangeable.

  3. <chr> Character. A variable with letter and/or number values. Mathematical operations are not defined, but other functions exist (e.g., extract the first and last characters, etc.) e.g., preschoice

  4. <fct> Factor. A variable defining group membership. Mathematical operations are not defined, but they can be used in special ways in R. e.g. QLT20. Note how the values of a character variable like preschoice are in quotes while the values of a factor variable like QLT20 are not. Factor variable can “look” like numeric or character variables in that the values they take on may be numbers or letters (or both), but R interprets them differently than numeric and character variables and they can be used to do special things.

NOTE: There are also list objects, but we will cover them when needed.

A second way to get a sense of the data is to use the summary command which will report the quantiles for any integer or numeric variable, the count of every value for a factor variable, and a note if the variable is a character variable. Unlike glimpse which gives you a rough approximation of how much the data may vary, summary quantifies the variation of each numeric variable in a bit more detail.

summary(MI_final_small)
##       SEX           AGE10           PRSMI20        PARTYID     
##  Min.   :1.00   Min.   : 1.000   Min.   :0.00   Min.   :1.000  
##  1st Qu.:1.00   1st Qu.: 6.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :2.00   Median : 8.000   Median :1.00   Median :2.000  
##  Mean   :1.53   Mean   : 8.476   Mean   :1.63   Mean   :2.236  
##  3rd Qu.:2.00   3rd Qu.: 9.000   3rd Qu.:2.00   3rd Qu.:3.000  
##  Max.   :2.00   Max.   :99.000   Max.   :9.00   Max.   :9.000  
##                                                                
##      WEIGHT          QRACEAI          EDUC18           LGBT      
##  Min.   :0.1003   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:0.3775   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :0.8020   Median :1.000   Median :3.000   Median :2.000  
##  Mean   :1.0000   Mean   :1.572   Mean   :3.288   Mean   :2.224  
##  3rd Qu.:1.4498   3rd Qu.:1.000   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :5.0853   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##                                                   NA's   :615    
##     BRNAGAIN        LATINOS         RACISM20    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.000   Median :2.000  
##  Mean   :1.907   Mean   :2.175   Mean   :2.325  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000  
##  NA's   :615                     NA's   :615    
##                              QLT20      preschoice          Quality         
##  [DON'T READ] Don’t know/refused: 26   Length:1231        Length:1231       
##  Can unite the country          :125   Class :character   Class :character  
##  Cares about people like me     :121   Mode  :character   Mode  :character  
##  Has good judgment              :205                                        
##  Is a strong leader             :138                                        
##  NA's                           :616                                        
## 

This is useful because it can reveal potential issues that we need to deal with. If we look at the maximum value for AGE10 for example, we can see that it has a value of 99 even though the Exit Poll questionaire indicates that the largest value should only be a 10 (for those selecting “75 and over”). This is because respondents who skipped that question were coded as having a value of 99. We also see that in other variables (e.g., PARTYID) there are values of 9 even though there is no value associated with that in the questionannire. These are again missing data!

Recall that this is the actual data that was used and we we see right away that while some variables have explicit missing data (e.g., LGBT, BRNAGAIN, RACISM20, QLT20) others have missing data that is not recognized as missing by R because of how it was coded. Moreover, the code for missing data varies between variables in this case! A 9 indicates missing data in PARTYID but it means something real in AGE10 (which uses 99 to denote missing data).

There are several functions depending on the type of data. For data that takes on discrete values – either numeric or character – it can be helpful to see the distribution of values that occur within a variable. To do so we are going to get a count of each unique value associated with a variable in a tibble.

In the code that follows we are using the tidyverse pipe command %>% which you should interpret as meaning “and then”. Using piping (%>%), we can count the variable preschoice within the tibble MI_final. This is what we will do as we are going to pipe multiple commands to accomplish our intended tasks. Note that the default reporting is to arrange the rows from lowest to smallest according to the values of the variable being counted (here preschoice which is being reported in alphabetical order).

MI_final_small %>% 
  count(preschoice)
## # A tibble: 6 × 2
##   preschoice                          n
##   <chr>                           <int>
## 1 Another candidate                  25
## 2 Donald Trump, the Republican      459
## 3 Joe Biden, the Democrat           723
## 4 Refused                            14
## 5 Undecided/Don’t know                4
## 6 Will/Did not vote for president     6

Note that here we are printing creating a table that is printed to the console window and disappears.

Quick Exercise Can you create a tibble containing a table of vote choice?

# INSERT CODE HERE

If we try to count several variables what R will do is to present the count within each ordered value. So the following code will first count each value of preschoice and then count how that breaks down according to the values given in the variable SEX.

MI_final_small %>% 
  count(preschoice,SEX)
## # A tibble: 12 × 3
##    preschoice                        SEX     n
##    <chr>                           <dbl> <int>
##  1 Another candidate                   1    17
##  2 Another candidate                   2     8
##  3 Donald Trump, the Republican        1   247
##  4 Donald Trump, the Republican        2   212
##  5 Joe Biden, the Democrat             1   304
##  6 Joe Biden, the Democrat             2   419
##  7 Refused                             1     7
##  8 Refused                             2     7
##  9 Undecided/Don’t know                1     3
## 10 Undecided/Don’t know                2     1
## 11 Will/Did not vote for president     1     1
## 12 Will/Did not vote for president     2     5

This is an important reminder to always inspect your data and then you often, if not always, need to wrangle data before doing analysis!

Quick Exercise How many Democrats and Republicans are in our sample?

# INSERT CODE HERE

Note that count is not useful for many-valued (continuous) variables because each value is likely to occur only a small number of times.

MI_final %>% 
  count(WEIGHT)
## # A tibble: 411 × 2
##    WEIGHT     n
##     <dbl> <int>
##  1  0.100     1
##  2  0.113     1
##  3  0.119     1
##  4  0.133     2
##  5  0.141     1
##  6  0.142     1
##  7  0.144     1
##  8  0.146     1
##  9  0.147     1
## 10  0.149     5
## # ℹ 401 more rows

Here the summary function is more useful for describing the variation in our data. In addition to summarizing the entire dataset we can also select a specific variable and use summary.

MI_final %>% 
  select(WEIGHT) %>%
  summary(WEIGHT) 
##      WEIGHT      
##  Min.   :0.1003  
##  1st Qu.:0.3775  
##  Median :0.8020  
##  Mean   :1.0000  
##  3rd Qu.:1.4498  
##  Max.   :5.0853

Selecting Variables and Observations

Once we load a dataset into active memory and take a look to get a sense of what we have we often want to be able to focus on particular variables or observations. Not every variable or observation will be valuable for the analyses that we do and we want the ability to extract the relevant data either for future analyses (e.g., creating a new tibble with just the relevant data) or for the analysis that we want to do right now (e.g., perform a calculation using a specific variable in the tibble on a subset of the observations). If we are interested in the how respondents who self-identify as “born again” reported voting in 2020 in Michigan, for example, we only need to analyze the presidential vote choice of born-again voters.

For the data in this class, columns are variables and rows are observations. That is, each row is a unique data point and each column is a description of one feature of that data point. When doing analysis we often want to focus on observations with particular features – e.g., voters from a particular state in a nationwide survey.

Selecting variables (columns) using select

To begin, let’s create a new tibble called MI_small by extracting/selecting 4 variables from MI_final. If we have a large dataset it can be useful to create a smaller dataset to save computer memory and speed processing time. This is not a concern for any of the data we will use, but it is a useful illustration. Beyond creating new tibbles that are a subset of existing tibbles, the other use of select is to extract particular variables for analysis within the tibble.

In the code that follows we are using the tidyverse pipe command %>% which you should interpret as meaning “and then”. So, in English we would read this code as: “MI_Small is defined to be MI_final and then select the variables SEX, AGE10, PRSMI20, PARTYID and then glimpse the results”.

MI_small <- MI_final_small %>%
  select(SEX,AGE10,PRSMI20,PARTYID) %>%
  glimpse()
## Rows: 1,231
## Columns: 4
## $ SEX     <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1,…
## $ AGE10   <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 6, 8, 9, 10, 1, 5, 9, 10, 8, 4, 1, 8,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1, 1,…

What this code does is to create a new tibble called MI_small that consists of the variables SEX, AGE10, PRSMI20, and PARTYID from the MI_final tibble and then prints a glimpse of the new tibble to the screen so we can check to confirm that it did what we wanted it to do. While MI_final had a dimension of 1231, 63, the new tibble MI_small has a dimension of 1231, 4.

We can also drop variables from a tibble by negatively selecting them. To drop AGE10 we just “subtract” the variable from the selection as follows:

MI_small %>% 
  select(-AGE10) %>%
  glimpse()
## Rows: 1,231
## Columns: 3
## $ SEX     <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1, 1,…

Note that I am piping through the glimpse to confirm that the code is doing what I think it is doing. A large number of coding and analysis mistakes are caused by differences in what the data scientists thinks the data is relative to what the data actually is so it is usually a good ideal to confirm that you have done what you think you have done

Quick Exercise Create a new tibble called MI_small_2 that contains the variables EDUC18, PARTYID, and SEX.

# INSERT CODE HERE

We can also select variables based on features of the variable names. To select all variables that start with the letter “P”, for example, we would use:

MI_small %>%
  select(starts_with("P")) %>%
  glimpse()
## Rows: 1,231
## Columns: 2
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1, 1,…

Why would we ever do this? If you are working with a large number of variables it can sometimes be useful to have uniform naming conventions (e.g., have all demographic variables start with the prefix “d_” as in “d_age”).

We can also select variables that end with (or do not end with) a particular set of values. The code below selects all variables that do not – hence the ! that tells R to do the opposite of the function (i.e., “does not”) – end in a 0.

MI_small %>% 
  select(!ends_with("0")) %>%
  glimpse()
## Rows: 1,231
## Columns: 2
## $ SEX     <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1, 1,…

These functions are admittedly a bit specific and we won’t really use them much in class, but it is good to know the functionality exists. Note that we are not limited to a single character, we could select variables that starts_with("PAR") or ends_with("20").

We can also select a range of variables using the sequence notation : (all values between) that takes all variables between the two variables– including the two variable. For example, to select all variables between SEX and PRSMI20 in the tibble we could do the following.

MI_final_small %>% 
  select(SEX:PRSMI20) %>%
  glimpse()
## Rows: 1,231
## Columns: 3
## $ SEX     <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1,…
## $ AGE10   <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 6, 8, 9, 10, 1, 5, 9, 10, 8, 4, 1, 8,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1,…

Before reading the next sentence, can you think why this is not good coding practice? Answer: it makes the code depend on the arrangement of the variables in your data such that a different arrangement of variables will produce differences in output. We always want our code to be replicable and it is therefore desirable to use coding practices that are not going to be affected by reorderings that do not change the underlying data.

Reordering observations (rows) using arrange

When working with the rows of a tibble it can sometimes be useful to rearrange their order. The exit poll data we are working with, for example, has no natural ordering – the order of the data is the order in which it was collected. But we may want to rearrange the data to sort it in increasing (or decreasing) order according to selected variables. To do so we can use the arrange function to sort the tibble according to the values of the specified variables. To rearrange the MI_small tibble by SEX (from smallest value to largest value) we would use:

MI_final_small %>% 
  arrange(SEX) %>%
  glimpse()
## Rows: 1,231
## Columns: 14
## $ SEX        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ AGE10      <dbl> 9, 9, 8, 8, 9, 10, 5, 9, 8, 4, 1, 1, 6, 7, 7, 4, 99, 9, 10,…
## $ PRSMI20    <dbl> 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 0, 2,…
## $ PARTYID    <dbl> 3, 1, 1, 1, 3, 2, 4, 1, 3, 3, 3, 1, 1, 4, 3, 4, 3, 1, 3, 2,…
## $ WEIGHT     <dbl> 0.1991648, 1.3713049, 1.1540513, 1.3000787, 0.1469731, 0.84…
## $ QRACEAI    <dbl> 1, 1, 1, 2, 9, 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 1, 5, 1, 1, 1,…
## $ EDUC18     <dbl> 4, 3, 3, 4, 5, 5, 1, 1, 5, 2, 4, 1, 2, 5, 2, 2, 9, 2, 5, 1,…
## $ LGBT       <dbl> NA, 2, 2, NA, NA, NA, 2, NA, 2, NA, 2, 1, 2, NA, NA, NA, NA…
## $ BRNAGAIN   <dbl> NA, 1, 2, NA, NA, NA, 2, NA, 1, NA, 2, 2, 2, NA, NA, NA, NA…
## $ LATINOS    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ RACISM20   <dbl> NA, 2, 2, NA, NA, NA, 2, NA, 4, NA, 3, 1, 2, NA, NA, NA, NA…
## $ QLT20      <fct> Has good judgment, NA, NA, Cares about people like me, Is a…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality    <chr> "Has good judgment", NA, NA, "Cares about people like me", …

Note now how all of the values for SEX being glimpsed consist of the value 1. Also note that we have not saved this rearrangement - we have just rearranged the tibble. How would you save the rearrangement?

The default is to sort from smallest to largest, but to sort from largest to smallest we need to tell R to take the variable in descending (desc) order as follows.

MI_small %>% 
  arrange(desc(SEX)) %>%
  glimpse()
## Rows: 1,231
## Columns: 4
## $ SEX     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ AGE10   <dbl> 2, 10, 7, 8, 7, 6, 1, 10, 8, 9, 7, 8, 9, 1, 10, 6, 10, 5, 7, 9…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 9,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 2, 4, 1, 1, 3, 2, 1, 1, 1, 1, 1, 3, 3, 1, 3, 4,…

Quick Exercise Sort the new tibble MI_small_2 you created by education level and save the sorted data.

# INSERT CODE HERE

We can also sort using multiple variables. If we arrange using several variables R will process them sequentially — sort by the first variable, then sort again within each of the sorted variables according to the values of the second variable, and so on. So if we wanted to sort by age by gender in ascending order we would use:

MI_small %>% 
  arrange(SEX, AGE10) %>%
  glimpse()
## Rows: 1,231
## Columns: 4
## $ SEX     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ AGE10   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ PRSMI20 <dbl> 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 9, 8, 1,…
## $ PARTYID <dbl> 3, 1, 3, 2, 2, 4, 4, 2, 2, 2, 3, 1, 3, 4, 4, 1, 3, 3, 3, 9, 4,…

So you can now see that the data is sorted by AGE10 within each value of SEX. There is no limit to the number of sorts we can do, or to whether we sort by ascending or descending order.

As a test of what we have done so far, can you predict what the following code will produce?

MI_small %>% 
  select(SEX,AGE10,PARTYID) %>%
  arrange(SEX, AGE10, desc(PARTYID)) %>%
  glimpse()
## Rows: 1,231
## Columns: 3
## $ SEX     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ AGE10   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ PARTYID <dbl> 4, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 9, 9, 4, 4, 3,…

To be honest, we don’t really do much with arrange other than if we are inspecting the data visually or if we are making the output a bit more sensible.
Recall that the count function produces a tibble that is sorted according to the values of the variable being counted. This is not always sensible as we may want to sort the tibble according to the most-frequently occuring value in the data. If we wanted to produce a table of values sorted in descending order so that the top row was the most frequently occurring value we could arrange after piping thru a count.1

MI_final_small %>% 
  count(preschoice) %>%
  arrange(desc(n))
## # A tibble: 6 × 2
##   preschoice                          n
##   <chr>                           <int>
## 1 Joe Biden, the Democrat           723
## 2 Donald Trump, the Republican      459
## 3 Another candidate                  25
## 4 Refused                            14
## 5 Will/Did not vote for president     6
## 6 Undecided/Don’t know                4

Selecting observations (rows) using filter

So far all we have done is to rearrange the observations (rows) in a tibble. If we want to extract particular observations then we need to use the filter function.

To select all the male respondents (i.e., SEX takes on the value of “1”), we could use filter as follows:

MI_final_small %>%
  filter(SEX == 2) %>%
  glimpse()
## Rows: 652
## Columns: 14
## $ SEX        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ AGE10      <dbl> 2, 10, 7, 8, 7, 6, 1, 10, 8, 9, 7, 8, 9, 1, 10, 6, 10, 5, 7…
## $ PRSMI20    <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ PARTYID    <dbl> 3, 1, 1, 3, 3, 2, 4, 1, 1, 3, 2, 1, 1, 1, 1, 1, 3, 3, 1, 3,…
## $ WEIGHT     <dbl> 0.4045421, 1.8052619, 0.8601966, 0.1772090, 0.4921975, 1.50…
## $ QRACEAI    <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2,…
## $ EDUC18     <dbl> 4, 1, 5, 5, 3, 4, 4, 1, 2, 2, 4, 2, 3, 2, 4, 3, 2, 5, 4, 4,…
## $ LGBT       <dbl> NA, 2, 2, NA, 2, NA, NA, 2, 2, NA, NA, 2, NA, 2, 2, NA, 2, …
## $ BRNAGAIN   <dbl> NA, 1, 2, NA, 2, NA, NA, 2, 2, NA, NA, 2, NA, 2, 1, NA, 9, …
## $ LATINOS    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ RACISM20   <dbl> NA, 2, 2, NA, 2, NA, NA, 9, 2, NA, NA, 3, NA, 1, 2, NA, 2, …
## $ QLT20      <fct> Has good judgment, NA, NA, Cares about people like me, NA, …
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality    <chr> "Has good judgment", NA, NA, "Cares about people like me", …

When we glimpse the results we can see that every value for SEX is a 2 as requested.

Note the syntax: we use == to denote “is equal to” and we use the value 2 because it is a numeric.

We can also obviously combine filter with select to extract a subset of variables and observations as follows:

MI_final_small %>%
  select(SEX,PARTYID,preschoice) %>%
  filter(SEX == 2) %>%
  glimpse()
## Rows: 652
## Columns: 3
## $ SEX        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ PARTYID    <dbl> 3, 1, 1, 3, 3, 2, 4, 1, 1, 3, 2, 1, 1, 1, 1, 1, 3, 3, 1, 3,…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …

If we wanted to select observations for based on a character or factor variable the syntax would be somewhat different because of the difference in a numeric and character value. To select only respondents who supported Joe Biden using the preschoice variable, for example, we would use quotes to denote that the value we are filtering is a character:

MI_final_small %>%
  filter(preschoice == "Joe Biden, the Democrat") %>%
  glimpse()
## Rows: 723
## Columns: 14
## $ SEX        <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2,…
## $ AGE10      <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 8, 1, 9, 10, 4, 1, 8, 1, 6, 9, 1, …
## $ PRSMI20    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ PARTYID    <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 1, 4, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1,…
## $ WEIGHT     <dbl> 0.4045421, 1.8052619, 0.8601966, 0.1991648, 0.1772090, 0.49…
## $ QRACEAI    <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ EDUC18     <dbl> 4, 1, 5, 4, 5, 3, 3, 3, 4, 4, 1, 1, 2, 4, 2, 1, 2, 3, 2, 4,…
## $ LGBT       <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, 2, NA, 2, 2, 1, 2, N…
## $ BRNAGAIN   <dbl> NA, 1, 2, NA, NA, 2, 1, 2, NA, NA, NA, 2, NA, 2, 2, 2, 2, N…
## $ LATINOS    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ RACISM20   <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, 9, NA, 3, 2, 1, 2, N…
## $ QLT20      <fct> Has good judgment, NA, NA, Has good judgment, Cares about p…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality    <chr> "Has good judgment", NA, NA, "Has good judgment", "Cares ab…

Selecting and Filtering based on conditional statements

Many times we want to select observations (and/or variables) depending on whehter or not several conditions are satisfied. To do so we use the fact that:

So if we are considering condition A and condition B, filtering based on & will select observations for which both A and B are satisfied (i.e., the intersection) whereas | will select observations for which either A or B are satisfied (i.e., the union). Note that there is no limit to the number of conditions we can use and we can also combine them.

** SELF-TEST: If A = self-identified Democrat, B = self-identified Republican, and C = self-identified Independent how would you interpret each of the conditions above in terms of which respondents would be filtered?

Let’s see how to implement this in R. According to the values coded in the Michigan Exit Poll we see that:

To select the self-reported vote choices of females under the age of 24 we would want to select observations from individuals who are both female and under the age of 24. To do so, we use the following (note that we are selecting to focus on the most relevant variables as practice – the code also works without this step):

MI_final_small %>%
  select(SEX,AGE10,preschoice) %>%
    filter(SEX == 2 & AGE10 == 1) %>%
    glimpse()
## Rows: 17
## Columns: 3
## $ SEX        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ AGE10      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …

OK, but it is hard to know what to take away from this. How about pulling together content from above to produce the following?

MI_final_small %>%
    filter(SEX == 2 & AGE10 == 1) %>%
    count(preschoice) %>%
    arrange(desc(n))
## # A tibble: 3 × 2
##   preschoice                       n
##   <chr>                        <int>
## 1 Joe Biden, the Democrat         15
## 2 Another candidate                1
## 3 Donald Trump, the Republican     1

Note that here we have dropped the select code because the point of our code is summarize the distribution of self-reported voting behavior, not produce a new tibble for subsequent analysis. Because we are using the count function applied to the variable preschoice the piping will automatically select the relevant variable.

Notice the difference in the nature and dimensions of the resulting tibbles from these two code snippets. The first code snippet produced a tibble with 17 observations and 3 columns – it is a new data set we can analyze (if we had saved it!). But the second code snippet is a tibble that consists only of 3 observations and 2 columns because it is counting the number of times that each of the 3 unique values occur among the filtered set of respondents.

Quick Exercise Replicate the analysis using male respondents (i.e., SEX==1)? What do you observe about the number and distribution of vote choice?

# INSERT CODE HERE

In addition to selecting cases for which both conditions are satisfied we can also select conditions for which either condition A or condition B is satisfied. If we want to focus on voters who voted for either Biden or Trump – and ignore those who self-reported voting for some other candidate – we can filter the data accordingly:

MI_final_small %>%
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    glimpse()
## Rows: 1,182
## Columns: 14
## $ SEX        <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2,…
## $ AGE10      <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 6, 8, 9, 10, 1, 5, 9, 10, 8, 4, 1,…
## $ PRSMI20    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1,…
## $ PARTYID    <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1,…
## $ WEIGHT     <dbl> 0.4045421, 1.8052619, 0.8601966, 0.1991648, 0.1772090, 0.49…
## $ QRACEAI    <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 9, 1, 1, 1, 1, 1, 3, 1, 1, 1,…
## $ EDUC18     <dbl> 4, 1, 5, 4, 5, 3, 3, 3, 4, 4, 5, 5, 4, 1, 1, 1, 5, 2, 4, 2,…
## $ LGBT       <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, NA, NA, 2, NA, 2, 2,…
## $ BRNAGAIN   <dbl> NA, 1, 2, NA, NA, 2, 1, 2, NA, NA, NA, NA, NA, 2, NA, 2, 1,…
## $ LATINOS    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,…
## $ RACISM20   <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, NA, NA, 2, NA, 9, 4,…
## $ QLT20      <fct> Has good judgment, NA, NA, Has good judgment, Cares about p…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality    <chr> "Has good judgment", NA, NA, "Has good judgment", "Cares ab…

Note that this produces a new tibble that could be used for subsequent analysis containing only respondents who report voting for either Biden or Trump. Because the number of rows decreases from 1231 in MI_Final_small to 1182 after the filter we have just performed, we can determine that 49 respondents reported voting for a candidate other than Biden or Trump.

To reiterate how piped functions can change the meaning of a tibble, consider how the size and content of the code we just ran compares to the following:

MI_final_small %>%
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    count(preschoice) %>%
    arrange(desc(n))
## # A tibble: 2 × 2
##   preschoice                       n
##   <chr>                        <int>
## 1 Joe Biden, the Democrat        723
## 2 Donald Trump, the Republican   459

Again, once we pipe (%>%) thru the count function to the filtered tibble, the tibble changes from being organized by observations (1182 x 14) to being organized by the number of unique values in the variable being counted (2). Moreover, the meaning of the columns changes from being variables associated with each observation to being the number of observations taking on each value.

Always know what your tibble looks like and what it contains!

NOTE: We can also use conditionals when selecting variables. For example:

MI_final %>%
  select(SEX & starts_with("P")) %>%
  glimpse()
## Rows: 1,231
## Columns: 0
MI_final %>%
  select(SEX | starts_with("P")) %>%
  glimpse()
## Rows: 1,231
## Columns: 4
## $ SEX     <hvn_lbl_> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, …
## $ PRSMI20 <hvn_lbl_> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, …
## $ PARTYID <hvn_lbl_> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, …
## $ PHIL3   <hvn_lbl_> 2, 2, 1, 9, 1, 2, 9, 2, 3, 2, 3, 3, 1, 3, 9, 2, 2, 1, 3, …

Combining filters and tibbles

Using piping (%>%), we count the variable preschoice within the tibble MI_final. This is what we will do as we are going to pipe multiple commands to accomplish our intended tasks.

MI_final_small %>% 
  count(preschoice)
## # A tibble: 6 × 2
##   preschoice                          n
##   <chr>                           <int>
## 1 Another candidate                  25
## 2 Donald Trump, the Republican      459
## 3 Joe Biden, the Democrat           723
## 4 Refused                            14
## 5 Undecided/Don’t know                4
## 6 Will/Did not vote for president     6

Note that here we are printing creating a table that is printed to the console window and disappears.

If we wanted to save it for later we could assign this to a new object and then manipulate it.

MI_prescount <- MI_final_small %>% 
  count(preschoice)

To access this I can then manipulate it. Note that calling MI_prescount directly reveals that it contains two variables – a variable of value labels with the same name as the variable that was counted (here preschoice) and a variable of the number of observations associated with each value (here n).

MI_prescount
## # A tibble: 6 × 2
##   preschoice                          n
##   <chr>                           <int>
## 1 Another candidate                  25
## 2 Donald Trump, the Republican      459
## 3 Joe Biden, the Democrat           723
## 4 Refused                            14
## 5 Undecided/Don’t know                4
## 6 Will/Did not vote for president     6

You can then access this using all of the tools we have already talked about. For example, to get the number of Biden supporters we can select the relevant row using filter via

MI_prescount %>%
  filter(preschoice == "Joe Biden, the Democrat")
## # A tibble: 1 × 2
##   preschoice                  n
##   <chr>                   <int>
## 1 Joe Biden, the Democrat   723

Quick Exercise Extract the number of respondents who chose either Biden or Trump.

# INSERT CODE HERE

Consider what happens if we instead do:

MI_final_small %>% 
  select(preschoice) %>%
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1231

What did we do ?!?! Now we have selected the variable preschoice and then counted up the number of observations. Before, we were counting the variable preschoice which was counting each value.

Missing Data

Several of our variables were only asked of half the sample (the 2020 Michigan exit poll used two different sets of questionaires to try to ask more questions) and it is important to account for that when thinking about the amount of data we have and what might be possible.

For example, let’s see why respondents reported voting for a candidate.

MI_final_small %>%
  count(Quality)
## # A tibble: 6 × 2
##   Quality                             n
##   <chr>                           <int>
## 1 Can unite the country             125
## 2 Cares about people like me        121
## 3 Has good judgment                 205
## 4 Is a strong leader                138
## 5 [DON'T READ] Don’t know/refused    26
## 6 <NA>                              616

So here we see based on the number of NA that nearly half of the respondents do not have a valid reply. We can use the drop_na function to remove missing data that R recognizes (more on this next lecture!). Note that the responses also contain an actual value for missing data, but we will deal with that later.

If we want to drop all of the missing observations in Quality we can use the following – note that drop_na is essentially applying a filter to remove missing data.

MI_final_small %>%
    drop_na(Quality) %>%
    count(Quality)
## # A tibble: 5 × 2
##   Quality                             n
##   <chr>                           <int>
## 1 Can unite the country             125
## 2 Cares about people like me        121
## 3 Has good judgment                 205
## 4 Is a strong leader                138
## 5 [DON'T READ] Don’t know/refused    26

We can include multiple variables in this list. Note that if we do not supply a list of variables it will default to the entire data set! Given that some questions are only asked to half of the sample, why does the following code produce a tibble with 0 rows?

MI_final_small %>%
    drop_na() %>%
    count(Quality)
## # A tibble: 0 × 2
## # ℹ 2 variables: Quality <chr>, n <int>

This highlights the importance of knowing what your data looks like in terms of missing data and also why it can be important to use select to first remove the variables of most interest. You do not want to remove data because of missingness in variables that you do not care about! As we will see, some functions have a built-in parameter to deal with missing data (e.g., mean) while others do not so always know your data!

Part II: Univariate Data Analysis (NBA Data)

Univariate is pretty much what it sounds like: one variable. When undertaking univariate data analysis, we need first and foremost to figure what type of variable it is that we’re working with. Once we do that, we can choose the appropriate use of the variable, either as an outcome or as a possible predictor.

Motivating Question

We’ll be working with data from every NBA player who was active during the 2018-19 season.

Here’s the data:

require(tidyverse)
nba<-read_rds("https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/nba_players_2018.Rds")

This data contains the following variables:

Codebook for NBA Data

Name Definition
namePlayer Player name
idPlayer Unique player id
slugSeason Season start and end
numberPlayerSeason Which season for this player
isRookie Rookie season, true or false
slugTeam Team short name
idTeam Unique team id
gp Games Played
gs Games Started
fgm Field goals made
fga Field goals attempted
pctFG Percent of field goals made
fg3m 3 point field goals made
fg3a 3 point field goals attempted
pctFG3 Percent of 3 point field goals made
pctFT Free Throw percentage
fg2m 2 point field goals made
fg2a 2 point field goals attempted
pctFG2 Percent of 2 point field goals made
agePlayer Player age
minutes Minutes played
ftm Free throws made
fta Free throws attempted
oreb Offensive rebounds
dreb Defensive rebounds
treb Total rebounds
ast Assists
blk Blocks
tov Turnovers
pf Personal fouls
pts Total points
urlNBAAPI Source url

We’re interested in the following questions:

To answer these questions we need to look at the following variables:

We’re going to go through a pretty standard set of steps for each variable. First, examine some cases. Second, based on our examination, we’ll try either a plot or a table. Once we’ve seen the plot or the table, we’ll think a bit about ordering, and then choose an appropriate measure of central tendency, and maybe variation.

Types of Variables

It’s really important to understand the types of variables you’re working with. Many times analysts are indifferent to this step particularly with larger datasets. This can lead to a great deal of confusion down the road. Below are the variable types we’ll be working with this semester and the definition of each.

Continuous Variables

A continuous variable can theoretically be subdivided at any arbitrarily small measure and can still be identified. You may have encountered further subdivision of continuous variables into “interval” or “ratio” data in other classes. We RARELY use these distinctions in practice. The distinction between a continuous and a categorical variable is hugely consequential, but the distinction between interval and ratio is not really all that important in practice.

The mean is the most widely used measure of central tendency for a continuous variable. If the distribution of the variable isn’t very symmetric or there are large outliers, then the median is a much better measure of central tendency.

Categorical Variables

A categorical variables divides the sample up into a set of mutually exclusive and exhaustive categories. Mutually exclusive means that each case can only be one, and exhaustive means that the categories cover every possible option. Categorical is sort of the “top” level classification for variables of this type. Within the broad classification of categorical there are multiple types of other variables.

Categorical: ordered

an ordered categorical variable has– you guessed it– some kind of sensible order that can be applied. For instance, the educational attainment of an individual: high school diploma, associates degree, bachelor’s degree, graduate degree– is an ordered categorical variable.

Ordered categorical variables should be arranged in the order of the variable, with proportions or percentages associated with each order. The mode, or the category with the highest proportion, is a reasonable measure of central tendency, but with fewer than ten categories the analyst should generally just show the proportion in each category.

Categorical: ordered, binary

An ordered binary variable has just two levels, but can be ordered. For instance, is a bird undertaking its first migration: yes or no? A “no” means that the bird has more than one.

The mean of a binary variable is exactly the same thing as the proportion of the sample with that characteristic. So, the mean of a binary variable for “first migration” where 1=“yes” will give the proportion of birds migrating for the first time.

An ordered binary variable coded as 0 or 1 can be summarized using the mean which is the same thing as the proportion of the sample with that characteristic.

Categorical: unordered

An unordered categorical variable has no sensible ordering that can be applied. Think about something like college major. There’s no “number” we might apply to philosophy that has any meaningful distance from a number we might apply to chemical engineering.

Unlike an ordered variable, an unordered categorical variable should be ordered in terms of the proportions falling into each of the categories. As with an unordered variable, it’s best just to show the proportions in each category for variables with less than ten levels. The mode is a reasonable single variable summary of an unordered categorical variable.

Categorical: unordered, binary

This kind of variable has no particular order, but can be just binary. A “1” means that the case has that characteristics, a “0” means the case does not have that characteristic. For instance, whether a tree is deciduous or not.

An unordered binary variable coded as 0 or 1 can also be summarized by the mean, which is the same thing as the proportion of the sample with that characteristic.

Formats for categorical variables

In R, categorical variables CAN be stored as text, numbers or even logicals. Don’t count on the data to help you out– you as the analyst need to figure this out.

Factors

We probably need to talk about factors. In R, a factor is a way of storing categorical variables. The factor provides additional information, including an ordering of the variable and a number assigned to each “level” of the factor. A categorical variable is a general term that’s understood across statistics. A factor variable is a specific R term. Most of the time it’s best not to have a categorical variable structured as a factor unless you know you want it to be a factor. More on this later . . .

The Process: #TrustTheProcess

I’m going to walk you through how an analyst might typically decide what type of variables they’re working with. It generally works like this:

  1. Take a look at a few observations and form a guess as to what type of variable it is.
  2. Based on that guess, create an appropriate plot or table.
  3. If the plot or table looks as expected, calculate some summary measures. If not, go back to 1.

“Glimpse” to start: what’s in here anyway?

The first thing we’re going to do with any dataset is just to take a quick look. We can call the data itself, but that will just show the first few cases and the first few variables. Far better is the glimpse command, which shows us all variables and the first few observations for all of the variables. Here’s a link to the codebook for this dataset:

The six variables we’re going to think about are field goals, free throw percentage, seasons played, rookie season, college attended, and conference played in.

glimpse(nba)
## Rows: 530
## Columns: 37
## $ namePlayer         <chr> "LaMarcus Aldridge", "Quincy Acy", "Steven Adams", …
## $ idPlayer           <dbl> 200746, 203112, 203500, 203518, 1628389, 1628959, 1…
## $ slugSeason         <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-1…
## $ numberPlayerSeason <dbl> 12, 6, 5, 2, 1, 0, 0, 0, 0, 0, 8, 5, 4, 3, 1, 1, 1,…
## $ isRookie           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE…
## $ slugTeam           <chr> "SAS", "PHX", "OKC", "OKC", "MIA", "CHI", "UTA", "C…
## $ idTeam             <dbl> 1610612759, 1610612756, 1610612760, 1610612760, 161…
## $ gp                 <dbl> 81, 10, 80, 31, 82, 10, 38, 19, 34, 7, 81, 72, 43, …
## $ gs                 <dbl> 81, 0, 80, 2, 28, 1, 2, 3, 1, 0, 81, 72, 40, 4, 80,…
## $ fgm                <dbl> 684, 4, 481, 56, 280, 13, 67, 11, 38, 3, 257, 721, …
## $ fga                <dbl> 1319, 18, 809, 157, 486, 39, 178, 36, 110, 10, 593,…
## $ pctFG              <dbl> 0.519, 0.222, 0.595, 0.357, 0.576, 0.333, 0.376, 0.…
## $ fg3m               <dbl> 10, 2, 0, 41, 3, 3, 32, 6, 25, 0, 96, 52, 9, 24, 6,…
## $ fg3a               <dbl> 42, 15, 2, 127, 15, 12, 99, 23, 74, 4, 280, 203, 34…
## $ pctFG3             <dbl> 0.2380952, 0.1333333, 0.0000000, 0.3228346, 0.20000…
## $ pctFT              <dbl> 0.847, 0.700, 0.500, 0.923, 0.735, 0.667, 0.750, 1.…
## $ fg2m               <dbl> 674, 2, 481, 15, 277, 10, 35, 5, 13, 3, 161, 669, 1…
## $ fg2a               <dbl> 1277, 3, 807, 30, 471, 27, 79, 13, 36, 6, 313, 1044…
## $ pctFG2             <dbl> 0.5277995, 0.6666667, 0.5960347, 0.5000000, 0.58811…
## $ agePlayer          <dbl> 33, 28, 25, 25, 21, 21, 23, 22, 23, 26, 28, 24, 25,…
## $ minutes            <dbl> 2687, 123, 2669, 588, 1913, 120, 416, 194, 428, 22,…
## $ ftm                <dbl> 349, 7, 146, 12, 166, 8, 45, 4, 7, 1, 150, 500, 37,…
## $ fta                <dbl> 412, 10, 292, 13, 226, 12, 60, 4, 9, 2, 173, 686, 6…
## $ oreb               <dbl> 251, 3, 391, 5, 165, 11, 3, 3, 11, 1, 112, 159, 48,…
## $ dreb               <dbl> 493, 22, 369, 43, 432, 15, 20, 16, 49, 3, 498, 739,…
## $ treb               <dbl> 744, 25, 760, 48, 597, 26, 23, 19, 60, 4, 610, 898,…
## $ ast                <dbl> 194, 8, 124, 20, 184, 13, 25, 5, 65, 6, 104, 424, 1…
## $ stl                <dbl> 43, 1, 117, 17, 71, 1, 6, 1, 14, 2, 68, 92, 54, 22,…
## $ blk                <dbl> 107, 4, 76, 6, 65, 0, 6, 4, 5, 0, 33, 110, 37, 13, …
## $ tov                <dbl> 144, 4, 135, 14, 121, 8, 33, 6, 28, 2, 72, 268, 58,…
## $ pf                 <dbl> 179, 24, 204, 53, 203, 7, 47, 13, 45, 4, 143, 232, …
## $ pts                <dbl> 1727, 17, 1108, 165, 729, 37, 211, 32, 108, 7, 760,…
## $ urlNBAAPI          <chr> "https://stats.nba.com/stats/playercareerstats?Leag…
## $ n                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ org                <fct> Texas, NA, Other, FC Barcelona Basquet, Kentucky, N…
## $ country            <chr> NA, NA, NA, "Spain", NA, NA, NA, NA, NA, NA, NA, "S…
## $ idConference       <int> 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, …

Continuous

Let’s start by taking a look at field goals. It seems pretty likely that this is a continuous variable. Let’s take a look at the top 50 spots.

nba%>% ## Start with the dataset
  select(namePlayer,slugTeam,fgm)%>% ## and then select a few variables
  arrange(-fgm)%>% ## arrange in reverse order of field goals
  print(n=50) ## print out the top 50
## # A tibble: 530 × 3
##    namePlayer            slugTeam   fgm
##    <chr>                 <chr>    <dbl>
##  1 James Harden          HOU        843
##  2 Bradley Beal          WAS        764
##  3 Kemba Walker          CHA        731
##  4 Giannis Antetokounmpo MIL        721
##  5 Kevin Durant          GSW        721
##  6 Paul George           OKC        707
##  7 Nikola Vucevic        ORL        701
##  8 LaMarcus Aldridge     SAS        684
##  9 Damian Lillard        POR        681
## 10 Karl-Anthony Towns    MIN        681
## 11 Donovan Mitchell      UTA        661
## 12 D'Angelo Russell      BKN        659
## 13 Klay Thompson         GSW        655
## 14 Stephen Curry         GSW        632
## 15 DeMar DeRozan         SAS        631
## 16 Russell Westbrook     OKC        630
## 17 Buddy Hield           SAC        623
## 18 Blake Griffin         DET        619
## 19 Nikola Jokic          DEN        616
## 20 Tobias Harris         MIN        611
## 21 Kyrie Irving          BOS        604
## 22 Devin Booker          PHX        586
## 23 Joel Embiid           PHI        580
## 24 CJ McCollum           POR        571
## 25 Julius Randle         NOP        571
## 26 Andre Drummond        DET        561
## 27 Kawhi Leonard         TOR        560
## 28 LeBron James          LAL        558
## 29 Jrue Holiday          NOP        547
## 30 Montrezl Harrell      LAC        546
## 31 Ben Simmons           PHI        540
## 32 Anthony Davis         NOP        530
## 33 Zach LaVine           CHI        530
## 34 Jordan Clarkson       CLE        529
## 35 Trae Young            ATL        525
## 36 Bojan Bogdanovic      IND        522
## 37 Pascal Siakam         TOR        519
## 38 Collin Sexton         CLE        519
## 39 Jamal Murray          DEN        513
## 40 Deandre Ayton         PHX        509
## 41 Luka Doncic           DAL        506
## 42 Khris Middleton       MIL        506
## 43 De'Aaron Fox          SAC        505
## 44 Andrew Wiggins        MIN        498
## 45 Kyle Kuzma            LAL        496
## 46 Mike Conley           MEM        490
## 47 Lou Williams          LAC        484
## 48 Steven Adams          OKC        481
## 49 Rudy Gobert           UTA        476
## 50 Clint Capela          HOU        474
## # ℹ 480 more rows

So what I’m seeing here is that field goals aren’t “clumped” at certain levels. Let’s confirm that by looking at a kernel density plot.

nba%>%
  ggplot(aes(x=fgm))+
  geom_density()

We can also use a histogram to figure out much the same thing.

nba%>%
  ggplot(aes(x=fgm))+
  geom_histogram()

Now, technically field goals don’t meet the definition I set out above as being a continuous variable because they aren’t divisible below a certain amount. Usually in practice though we just ignore this– this variable is “as good as” continuous, given that it varies smoothly over the range and isn’t confined to a relatively small set of possible values.

Quick Exercise: Do the same thing for field goal percentage and think about what kind of variable it is.

# INSERT CODE HERE

Measures for Continuous Variables

The mean is used most of the time for continuous variables, but it’s VERY sensitive to outliers. The median (50th percentile) is usually better, but it can be difficult to explain to general audiences.

nba%>%
  summarize(mean_fgm=mean(fgm))
## # A tibble: 1 × 1
##   mean_fgm
##      <dbl>
## 1     191.
nba%>%
  summarize(median_fgm=median(fgm))
## # A tibble: 1 × 1
##   median_fgm
##        <dbl>
## 1        157

In this case I’d really prefer the mean as a single measure of field goal production, but depending on the audience I still might just go ahead and use the median.

Quick Exercise What measure would you prefer for field goal percentage? Calculate that measure.

# INSERT CODE HERE

Categorical: ordered

Let’s take a look at player seasons.

nba%>%
  select(namePlayer,numberPlayerSeason)%>%
  arrange(-numberPlayerSeason)%>%
  print(n=50)
## # A tibble: 530 × 2
##    namePlayer        numberPlayerSeason
##    <chr>                          <dbl>
##  1 Vince Carter                      20
##  2 Dirk Nowitzki                     20
##  3 Jamal Crawford                    18
##  4 Tony Parker                       17
##  5 Tyson Chandler                    17
##  6 Pau Gasol                         17
##  7 Nene                              16
##  8 Carmelo Anthony                   15
##  9 Udonis Haslem                     15
## 10 LeBron James                      15
## 11 Zaza Pachulia                     15
## 12 Dwyane Wade                       15
## 13 Kyle Korver                       15
## 14 Luol Deng                         14
## 15 Devin Harris                      14
## 16 Dwight Howard                     14
## 17 Andre Iguodala                    14
## 18 JR Smith                          14
## 19 Trevor Ariza                      14
## 20 Andrew Bogut                      13
## 21 Jose Calderon                     13
## 22 Raymond Felton                    13
## 23 Amir Johnson                      13
## 24 Shaun Livingston                  13
## 25 Chris Paul                        13
## 26 Marvin Williams                   13
## 27 Lou Williams                      13
## 28 CJ Miles                          13
## 29 LaMarcus Aldridge                 12
## 30 J.J. Barea                        12
## 31 Channing Frye                     12
## 32 Rudy Gay                          12
## 33 Kyle Lowry                        12
## 34 Paul Millsap                      12
## 35 JJ Redick                         12
## 36 Rajon Rondo                       12
## 37 Thabo Sefolosha                   12
## 38 Marco Belinelli                   11
## 39 Mike Conley                       11
## 40 Kevin Durant                      11
## 41 Jared Dudley                      11
## 42 Marcin Gortat                     11
## 43 Gerald Green                      11
## 44 Al Horford                        11
## 45 Joakim Noah                       11
## 46 Thaddeus Young                    11
## 47 Nick Young                        11
## 48 Corey Brewer                      11
## 49 D.J. Augustin                     10
## 50 Jerryd Bayless                    10
## # ℹ 480 more rows

Looks like it might be continuous? Let’s plot it:

nba%>%
  ggplot(aes(x=numberPlayerSeason))+
  geom_histogram(binwidth = 1)

Nope. See how it falls into a small set of possible categories? This is an ordered categorical variable. That means we should calculate the proportions in each category

nba%>%
  group_by(numberPlayerSeason)%>%
  count(name="total_in_group")%>%
  ungroup()%>%
  mutate(proportion=total_in_group/sum(total_in_group))
## # A tibble: 20 × 3
##    numberPlayerSeason total_in_group proportion
##                 <dbl>          <int>      <dbl>
##  1                  0            105    0.198  
##  2                  1             89    0.168  
##  3                  2             56    0.106  
##  4                  3             43    0.0811 
##  5                  4             37    0.0698 
##  6                  5             33    0.0623 
##  7                  6             31    0.0585 
##  8                  7             25    0.0472 
##  9                  8             19    0.0358 
## 10                  9             20    0.0377 
## 11                 10             24    0.0453 
## 12                 11             11    0.0208 
## 13                 12              9    0.0170 
## 14                 13              9    0.0170 
## 15                 14              6    0.0113 
## 16                 15              6    0.0113 
## 17                 16              1    0.00189
## 18                 17              3    0.00566
## 19                 18              1    0.00189
## 20                 20              2    0.00377

What does this tell us?

Quick Exercise Create a histogram for player age. What does that tell us about the NBA?

# INSERT CODE HERE

Categorical: ordered, binary

Let’s take a look at the variable for Rookie season.

nba%>%select(namePlayer,isRookie)
## # A tibble: 530 × 2
##    namePlayer             isRookie
##    <chr>                  <lgl>   
##  1 LaMarcus Aldridge      FALSE   
##  2 Quincy Acy             FALSE   
##  3 Steven Adams           FALSE   
##  4 Alex Abrines           FALSE   
##  5 Bam Adebayo            FALSE   
##  6 Rawle Alkins           TRUE    
##  7 Grayson Allen          TRUE    
##  8 Deng Adel              TRUE    
##  9 Jaylen Adams           TRUE    
## 10 DeVaughn Akoon-Purcell TRUE    
## # ℹ 520 more rows

Okay, so that’s set to a logical. In R, TRUE or FALSE are special values that indicate the result of a logical question. In this it’s whether or not the player is a rookie.

Usually we want a binary variable to have at least one version that’s structured so that 1= TRUE and 2=FALSE. This makes data analysis much easier. Let’s do that with this variable.

This code uses ifelse to create a new variable called isRookiebin that’s set to 1 if the isRookie variable is true, and 0 otherwise.

nba<-nba%>%
  mutate(isRookie_bin=ifelse(isRookie==TRUE,1,0))

Now that it’s coded 0,1 we can calculate the mean, which is the same thing as the proportion of the players that are rookies.

nba%>%summarize(mean=mean(isRookie_bin))
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.198

Categorical: unordered

Let’s take a look at which college a player attended, which is a good example of an unordered categorical variable. The org variable tells us which organization the player was in before playing in the NBA.

nba%>%
  select(org)%>%
  glimpse()
## Rows: 530
## Columns: 1
## $ org <fct> Texas, NA, Other, FC Barcelona Basquet, Kentucky, NA, Duke, NA, NA…

This look like team or college names, so this would be a categorical variable. Let’s take a look at the counts of players from different organizations:

nba%>%
  group_by(org)%>%
  count()%>%
  arrange(-n)%>%
  print(n=50)
## # A tibble: 68 × 2
## # Groups:   org [68]
##    org                        n
##    <fct>                  <int>
##  1 <NA>                     157
##  2 Other                     85
##  3 Kentucky                  25
##  4 Duke                      17
##  5 California-Los Angeles    15
##  6 Kansas                    11
##  7 Arizona                   10
##  8 Texas                     10
##  9 North Carolina             9
## 10 Michigan                   8
## 11 Villanova                  7
## 12 Indiana                    6
## 13 Southern California        6
## 14 Syracuse                   6
## 15 California                 5
## 16 Louisville                 5
## 17 Ohio State                 5
## 18 Wake Forest                5
## 19 Colorado                   4
## 20 Connecticut                4
## 21 Creighton                  4
## 22 FC Barcelona Basquet       4
## 23 Florida                    4
## 24 Georgia Tech               4
## 25 Michigan State             4
## 26 Oregon                     4
## 27 Utah                       4
## 28 Washington                 4
## 29 Wisconsin                  4
## 30 Boston College             3
## 31 Florida State              3
## 32 Georgetown                 3
## 33 Gonzaga                    3
## 34 Iowa State                 3
## 35 Marquette                  3
## 36 Maryland                   3
## 37 Miami (FL)                 3
## 38 North Carolina State       3
## 39 Notre Dame                 3
## 40 Oklahoma                   3
## 41 Purdue                     3
## 42 Southern Methodist         3
## 43 Stanford                   3
## 44 Tennessee                  3
## 45 Virginia                   3
## 46 Anadolu Efes S.K.          2
## 47 Baylor                     2
## 48 Butler                     2
## 49 Cincinnati                 2
## 50 Kansas State               2
## # ℹ 18 more rows

Here we have a problem. If we’re interested just in colleges, we’re going to need to structure this a bit more. The code below filters out three categories that we don’t want: missing data, anything classified as others, and sports teams from other countries. The last is incomplete– I probably missed some! If I were doing this for real, I would use a list of colleges and only include those names.

What I do below is to negate the str_detect variable by placing the ! in front of it. This means I want all of the cases that don’t match the pattern the supplied. The pattern makes heavy use of the OR operator |. I’m saying I don’t want to include players whose organization included the letters CB r KK and so on (these are common prefixes for sports organizations in other countries, I definitely did not look that up on Wikipedia. Ok, I did.).

nba%>%
  filter(!is.na(org))%>%
  filter(!org=="Other")%>%
  filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
  group_by(org)%>%
  count()%>%
  arrange(-n)%>%
  print(n=50)
## # A tibble: 57 × 2
## # Groups:   org [57]
##    org                        n
##    <fct>                  <int>
##  1 Kentucky                  25
##  2 Duke                      17
##  3 California-Los Angeles    15
##  4 Kansas                    11
##  5 Arizona                   10
##  6 Texas                     10
##  7 North Carolina             9
##  8 Michigan                   8
##  9 Villanova                  7
## 10 Indiana                    6
## 11 Southern California        6
## 12 Syracuse                   6
## 13 California                 5
## 14 Louisville                 5
## 15 Ohio State                 5
## 16 Wake Forest                5
## 17 Colorado                   4
## 18 Connecticut                4
## 19 Creighton                  4
## 20 Florida                    4
## 21 Georgia Tech               4
## 22 Michigan State             4
## 23 Oregon                     4
## 24 Utah                       4
## 25 Washington                 4
## 26 Wisconsin                  4
## 27 Boston College             3
## 28 Florida State              3
## 29 Georgetown                 3
## 30 Gonzaga                    3
## 31 Iowa State                 3
## 32 Marquette                  3
## 33 Maryland                   3
## 34 Miami (FL)                 3
## 35 North Carolina State       3
## 36 Notre Dame                 3
## 37 Oklahoma                   3
## 38 Purdue                     3
## 39 Southern Methodist         3
## 40 Stanford                   3
## 41 Tennessee                  3
## 42 Virginia                   3
## 43 Baylor                     2
## 44 Butler                     2
## 45 Cincinnati                 2
## 46 Kansas State               2
## 47 Louisiana State            2
## 48 Memphis                    2
## 49 Missouri                   2
## 50 Murray State               2
## # ℹ 7 more rows

That looks better. Which are the most common colleges and universities that send players to the NBA?

Quick Exercise Arrange the number of players by team in descending order.

# INSERT CODE HERE

Categorical: unordered, binary

There are two conference in the NBA, eastern and western. Let’s take a look at the variable that indicates which conference the payer played in that season.

nba%>%select(idConference)%>%
  glimpse()
## Rows: 530
## Columns: 1
## $ idConference <int> 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, …

It looks like conference is structured as numeric, but a “1” or a “2”. Because it’s best to have binary variables structured as “has the characteristic” or “doesn’t have the characteristic” we’re going to create a variable for western conference that’s set to 1 if the player was playing in the western conference and 0 if the player was not (this is the same as playing in the eastern conference).

nba<-nba%>%
  mutate(west_conference=ifelse(idConference==1,1,0))

Once we’ve done that, we can see how many players played in each conference.

nba%>%
  summarize(mean(west_conference))
## # A tibble: 1 × 1
##   `mean(west_conference)`
##                     <dbl>
## 1                   0.508

Makes sense!

Quick Exercise:* create a variable for whether or not the player is from the USA. Calculate the proportion of players from the USA in the NBA. The coding on country is … decidedy US-centric, so you’ll need to think about this one a bit.

# INSERT CODE HERE

Analysis

Ok, now that we know how this works, we can do some summary analysis. First of all, what does the total number of field goals made look like by college?

We know that field goals are continuous (sort of) so let’s summarize them via the mean. We know that college is a categorical variable, so we’ll use that to group the data. This is one of our first examples of a conditiona mean, which we’ll use a lot.

Top 50 Colleges by Total FG

nba%>%
  filter(!is.na(org))%>%
  filter(!org=="Other")%>%
  filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
  group_by(org)%>%
  summarize(mean_fg=sum(fgm))%>%
  arrange(-mean_fg)%>%
  print(n=50)
## # A tibble: 57 × 2
##    org                    mean_fg
##    <fct>                    <dbl>
##  1 Kentucky                  6594
##  2 Duke                      4623
##  3 Texas                     3437
##  4 California-Los Angeles    3382
##  5 Kansas                    2765
##  6 Arizona                   2101
##  7 Oklahoma                  1767
##  8 Southern California       1758
##  9 Louisville                1679
## 10 North Carolina            1659
## 11 Indiana                   1522
## 12 Ohio State                1486
## 13 Michigan                  1392
## 14 Wake Forest               1364
## 15 Connecticut               1299
## 16 Villanova                 1222
## 17 Georgia Tech              1169
## 18 Tennessee                 1095
## 19 Stanford                   949
## 20 Utah                       943
## 21 Marquette                  873
## 22 Gonzaga                    863
## 23 Michigan State             820
## 24 Colorado                   818
## 25 Virginia                   816
## 26 Maryland                   811
## 27 Missouri                   756
## 28 California                 734
## 29 Florida State              733
## 30 Georgetown                 717
## 31 Memphis                    620
## 32 Florida                    618
## 33 North Carolina State       598
## 34 Boston College             586
## 35 Louisiana State            583
## 36 Syracuse                   567
## 37 Iowa State                 523
## 38 Butler                     459
## 39 Wisconsin                  456
## 40 Creighton                  432
## 41 Oregon                     352
## 42 Texas A&M                  322
## 43 Baylor                     312
## 44 Providence                 291
## 45 Purdue                     275
## 46 Notre Dame                 263
## 47 Ulkerspor                  252
## 48 Southern Methodist         246
## 49 Oklahoma State             242
## 50 West Virginia              236
## # ℹ 7 more rows

Next, what about field goal percentage?

Top 50 Colleges by Average Field Goal Percent

nba%>%
  filter(!is.na(org))%>%
  filter(!org=="Other")%>%
  filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
  group_by(org)%>%
  summarize(mean_ftp=mean(pctFT))%>%
  arrange(-mean_ftp)%>%
  print(n=50)
## # A tibble: 57 × 2
##    org                  mean_ftp
##    <fct>                   <dbl>
##  1 Tennessee               0.842
##  2 Virginia                0.833
##  3 Oklahoma                0.823
##  4 North Carolina State    0.817
##  5 West Virginia           0.804
##  6 Ulkerspor               0.803
##  7 Missouri                0.802
##  8 Wake Forest             0.802
##  9 Florida State           0.801
## 10 Murray State            0.798
## 11 Iowa State              0.795
## 12 Notre Dame              0.792
## 13 Memphis                 0.788
## 14 Florida                 0.784
## 15 Michigan                0.783
## 16 Stanford                0.779
## 17 Georgetown              0.775
## 18 Marquette               0.774
## 19 Utah                    0.770
## 20 Kansas State            0.767
## 21 Butler                  0.762
## 22 Gonzaga                 0.761
## 23 North Carolina          0.756
## 24 Villanova               0.755
## 25 Texas                   0.752
## 26 Connecticut             0.748
## 27 Providence              0.747
## 28 Boston College          0.742
## 29 Michigan State          0.730
## 30 Kansas                  0.729
## 31 Indiana                 0.729
## 32 Duke                    0.728
## 33 Baylor                  0.726
## 34 Arizona                 0.721
## 35 Pallacanestro Biella    0.718
## 36 Wisconsin               0.712
## 37 Kentucky                0.712
## 38 Georgia Tech            0.712
## 39 Louisiana State         0.709
## 40 Creighton               0.698
## 41 Maryland                0.695
## 42 Vanderbilt              0.688
## 43 Washington              0.680
## 44 Louisville              0.679
## 45 Ohio State              0.679
## 46 California              0.675
## 47 Southern Methodist      0.673
## 48 Oregon                  0.662
## 49 Texas A&M               0.652
## 50 Southern California     0.648
## # ℹ 7 more rows

Quick Exercise Calculate field goals made by player season.

# INSERT CODE HERE

Quick Exercise Calculate free throw percent made by player season.

# INSERT CODE HERE

  1. Note that in this instance we could also use arrange(-n) to sort in descending order. We use desc because it is slightly more generic (i.e., if we tried to use arrange(-SEX) above it would not work).↩︎